16 research outputs found
Visualizing Deep Networks by Optimizing with Integrated Gradients
Understanding and interpreting the decisions made by deep learning models is
valuable in many domains. In computer vision, computing heatmaps from a deep
network is a popular approach for visualizing and understanding deep networks.
However, heatmaps that do not correlate with the network may mislead human,
hence the performance of heatmaps in providing a faithful explanation to the
underlying deep network is crucial. In this paper, we propose I-GOS, which
optimizes for a heatmap so that the classification scores on the masked image
would maximally decrease. The main novelty of the approach is to compute
descent directions based on the integrated gradients instead of the normal
gradient, which avoids local optima and speeds up convergence. Compared with
previous approaches, our method can flexibly compute heatmaps at any resolution
for different user needs. Extensive experiments on several benchmark datasets
show that the heatmaps produced by our approach are more correlated with the
decision of the underlying deep network, in comparison with other
state-of-the-art approaches
Do we really need temporal convolutions in action segmentation?
Action classification has made great progress, but segmenting and recognizing
actions from long untrimmed videos remains a challenging problem. Most
state-of-the-art methods focus on designing temporal convolution-based models,
but the inflexibility of temporal convolutions and the difficulties in modeling
long-term temporal dependencies restrict the potential of these models.
Transformer-based models with adaptable and sequence modeling capabilities have
recently been used in various tasks. However, the lack of inductive bias and
the inefficiency of handling long video sequences limit the application of
Transformer in action segmentation. In this paper, we design a pure
Transformer-based model without temporal convolutions by incorporating temporal
sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture
reduces complexity while introducing an inductive bias that adjacent frames are
more likely to belong to the same class, but the introduction of coarse
resolutions results in the misclassification of boundaries. We observe that the
similarity distribution between a boundary frame and its neighboring frames
depends on whether the boundary frame is the start or end of an action segment.
Therefore, we further propose a boundary-aware loss based on the distribution
of similarity scores between frames from attention modules to enhance the
ability to recognize boundaries. Extensive experiments show the effectiveness
of our model
SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
As an important and challenging problem in computer vision, PAnoramic
Semantic Segmentation (PASS) gives complete scene perception based on an
ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic
image input focus on solving image distortions but lack consideration of the 3D
properties of original data. Therefore, their performance will
drop a lot when inputting panoramic images with the 3D disturbance. To be more
robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer
for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical
geometry knowledge. Specifically, a spherical geometry-aware framework is
proposed for PASS. It includes three modules, i.e., spherical geometry-aware
image projection, spherical deformable patch embedding, and a panorama-aware
loss, which takes input images with 3D disturbance into account, adds a
spherical geometry-aware constraint on the existing deformable patch embedding,
and indicates the pixel density of original data, respectively.
Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS
significantly improves performance and robustness, with approximately a 2%
increase in mIoU, and when small 3D disturbances occur in the data, the
stability of our performance is improved by an order of magnitude. Our code and
supplementary material are available at
https://github.com/TencentARC/SGAT4PASS.Comment: Accepted by IJCAI 202
Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation
Scene Graph Generation (SGG) aims to structurally and comprehensively
represent objects and their connections in images, it can significantly benefit
scene understanding and other related downstream tasks. Existing SGG models
often struggle to solve the long-tailed problem caused by biased datasets.
However, even if these models can fit specific datasets better, it may be hard
for them to resolve the unseen triples which are not included in the training
set. Most methods tend to feed a whole triple and learn the overall features
based on statistical machine learning. Such models have difficulty predicting
unseen triples because the objects and predicates in the training set are
combined differently as novel triples in the test set. In this work, we propose
a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen
triples and improve the generalisation capability of the SGG models. We propose
a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement
(FKR) module to learn object and predicate categories separately at the feature
level and align them with corresponding visual features so that the model is no
longer limited to triples matching. Besides, since we observe the long-tailed
problem also affects the generalization ability, we design a novel balanced
learning strategy, including a Charater Guided Sampling (CGS) and an
Informative Re-weighting (IR) module, to provide tailor-made learning methods
for each predicate according to their characters. Extensive experiments show
that our model achieves state-of-the-art performance. In more detail, TISGG
boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls
sub-task on the Visual Genome dataset
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
structure control is needed. In this paper, we aim to ``dig out" the
capabilities that T2I models have implicitly learned, and then explicitly use
them to control the generation more granularly. Specifically, we propose to
learn simple and small T2I-Adapters to align internal knowledge in T2I models
with external control signals, while freezing the original large T2I models. In
this way, we can train various adapters according to different conditions, and
achieve rich control and editing effects. Further, the proposed T2I-Adapters
have attractive properties of practical value, such as composability and
generalization ability. Extensive experiments demonstrate that our T2I-Adapter
has promising generation quality and a wide range of applications.Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapte
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
This paper presents a LoRA-free method for stylized image generation that
takes a text prompt and style reference images as inputs and produces an output
image in a single pass. Unlike existing methods that rely on training a
separate LoRA for each style, our method can adapt to various styles with a
unified model. However, this poses two challenges: 1) the prompt loses
controllability over the generated content, and 2) the output image inherits
both the semantic and style features of the style reference image, compromising
its content fidelity. To address these challenges, we introduce StyleAdapter, a
model that comprises two components: a two-path cross-attention module (TPCA)
and three decoupling strategies. These components enable our model to process
the prompt and style reference features separately and reduce the strong
coupling between the semantic and style information in the style references.
StyleAdapter can generate high-quality images that match the content of the
prompts and adopt the style of the references (even for unseen styles) in a
single pass, which is more flexible and efficient than previous methods.
Experiments have been conducted to demonstrate the superiority of our method
over previous works.Comment: AIG